Goto

Collaborating Authors

 Gran Chaco


Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team, null, Keren, Gil, Kozhevnikov, Artyom, Meng, Yen, Ropers, Christophe, Setzler, Matthew, Wang, Skyler, Adebara, Ife, Auli, Michael, Balioglu, Can, Chan, Kevin, Cheng, Chierh, Chuang, Joe, Droof, Caley, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Erben, Alexander, Gao, Cynthia, Gonzalez, Gabriel Mejia, Lyu, Kehan, Miglani, Sagar, Pratap, Vineel, Sadagopan, Kaushik Ram, Saleem, Safiyyah, Turkatenko, Arina, Ventayol-Boada, Albert, Yong, Zheng-Xin, Chung, Yu-An, Maillard, Jean, Moritz, Rashel, Mourachko, Alexandre, Williamson, Mary, Yates, Shireen

arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.


LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

Vujanic, Robin, Rueckstiess, Thomas

arXiv.org Artificial Intelligence

We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.


Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Ticona, Belu, Carranza, Fernando, Cotik, Viviana

arXiv.org Artificial Intelligence

Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.


Low-carbon milk to AI irrigation: tech startups powering Latin America's green revolution

The Guardian

Leo Prieto's passion for nature started during his childhood by the sea. "I was obsessed with what was under the surface. I'd anchor myself to a rock with my snorkel, and I was fascinated by all the little animals doing things that go unnoticed." His teenage years coincided with the arrival of the internet in Chile, where he became a web pioneer, launching and selling several startups. Inevitably, his interests in the environment, the internet and business merged, driven by the feeling that technological advances should not be wasted.


Mapping of Land Use and Land Cover (LULC) using EuroSAT and Transfer Learning

Kunwar, Suman, Ferdush, Jannatul

arXiv.org Artificial Intelligence

As the global population continues to expand, the demand for natural resources increases. Unfortunately, human activities account for 23% of greenhouse gas emissions. On a positive note, remote sensing technologies have emerged as a valuable tool in managing our environment. These technologies allow us to monitor land use, plan urban areas, and drive advancements in areas such as agriculture, climate change mitigation, disaster recovery, and environmental monitoring. Recent advances in AI, computer vision, and earth observation data have enabled unprecedented accuracy in land use mapping. By using transfer learning and fine-tuning with RGB bands, we achieved an impressive 99.19% accuracy in land use analysis. Such findings can be used to inform conservation and urban planning policies.


GlotLID: Language Identification for Low-Resource Languages

Kargaran, Amir Hossein, Imani, Ayyoob, Yvon, François, Schütze, Hinrich

arXiv.org Artificial Intelligence

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.


Multilingual Controllable Transformer-Based Lexical Simplification

Sheang, Kim Cheng, Saggion, Horacio

arXiv.org Artificial Intelligence

Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.


Labeled Bipolar Argumentation Frameworks

Escañuela Gonzalez, Melisa G. (Conasejo Nacional de Investigaciones Científicas y Técnicas (CONICET) - Universidad Nacional de Santiago del Estero (UNSE)) | Budán, Maximiliano C. D. | Simari, Gerardo I. (Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) - Universidad Nacional del Sur (UNS)) | Simari, Guillermo R. (Universidad Nacional del Sur (UNS))

Journal of Artificial Intelligence Research

An essential part of argumentation-based reasoning is to identify arguments in favor and against a statement or query, select the acceptable ones, and then determine whether or not the original statement should be accepted. We present here an abstract framework that considers two independent forms of argument interaction--support and conflict--and is able to represent distinctive information associated with these arguments. This information can enable additional actions such as: (i) a more in-depth analysis of the relations between the arguments; (ii) a representation of the user's posture to help in focusing the argumentative process, optimizing the values of attributes associated with certain arguments; and (iii) an enhancement of the semantics taking advantage of the availability of richer information about argument acceptability. Thus, the classical semantic definitions are enhanced by analyzing a set of postulates they satisfy. Finally, a polynomial-time algorithm to perform the labeling process is introduced, in which the argument interactions are considered.


Bipolar in Temporal Argumentation Framework

Budán, Maximiliano C. D., Cobo, Maria Laura, Martinez, Diego C., Simari, Guillermo R.

arXiv.org Artificial Intelligence

A Timed Argumentation Framework (TAF) is a formalism where arguments are only valid for consideration in a given period of time, called availability intervals, which are defined for every individual argument. The original proposal is based on a single, abstract notion of attack between arguments that remains static and permanent in time. Thus, in general, when identifying the set of acceptable arguments, the outcome associated with a TAF will vary over time. In this work we introduce an extension of TAF adding the capability of modeling a support relation between arguments. In this sense, the resulting framework provides a suitable model for different time-dependent issues. Thus, the main contribution here is to provide an enhanced framework for modeling a positive (support) and negative (attack) interaction varying over time, which are relevant in many real-world situations. This leads to a Timed Bipolar Argumentation Framework (T-BAF), where classical argument extensions can be defined. The proposal aims at advancing in the integration of temporal argumentation in different application domain.


Dealing with Qualitative and Quantitative Features in Legal Domains

Budán, Maximiliano C. D., Cobo, María Laura, Martínez, Diego I., Rotolo, Antonino

arXiv.org Artificial Intelligence

In this work, we enrich a formalism for argumentation by including a formal characterization of features related to the knowledge, in order to capture proper reasoning in legal domains. We add meta-data information to the arguments in the form of labels representing quantitative and qualitative data about them. These labels are propagated through an argumentative graph according to the relations of support, conflict, and aggregation between arguments.